IntlTokenize

IntlTokenize
Header:	Script.h		Carbon status:	Supported

Allows your application to convert text into a sequence of language-independent tokens.

TokenResults IntlTokenize (
    TokenBlockPtr tokenParam
);

Parameter descriptions

tokenParam: A pointer to a token block structure, TokenBlock. The structure specifies the text to be converted to tokens, the destination of the token list, a handle to the tokens ('itl4') resource, and a set of options.

function result

A list of tokens that correspond to the input text. The token list is an array of token structures (type TokenRec). Each token structure describes the token generated, specifies the part of the source text it came from, and optionally provides a character string that is a normalized version of the text that generated the token.

IntlTokenize also returns a result code that specifies the type of error that occurred, if any.

DISCUSSION

The token block structure is a parameter block. The relevant fields of the parameter block are:

source

A value of type Ptr. On input, a pointer to the beginning of the source text (not a Pascal string) to be converted.
sourceLength

A value of type LongInt. On input, the number of bytes in the source text.
tokenList

A value of type Ptr. On input, a pointer to a buffer you have allocated. On output, a pointer to a list of token structures generated by the IntlTokenize function.
tokenLength

A value of type LongInt. On input, the maximum size of token list (in number of tokens, not bytes) that will fit into the buffer pointed to by the tokenList field.
tokenCount

A value of type LongInt. On input (if doAppend = TRUE), must contain the correct number of tokens currently in the token list. (Ignored if doAppend = FALSE.) On output, the number of tokens currently in the token list.
stringList

A value of type Ptr. On input (if doString = TRUE), a pointer to a buffer you have allocated. (Ignored if doString = FALSE) On output, a pointer to a list of strings generated by the IntlTokenize function.
stringLength

A value of type LongInt.On input (if doString = TRUE), the size in bytes of the string list buffer pointed to by the stringList field. (Ignored if doString = FALSE.)
stringCount

A value of type LongInt. On input (if doString = TRUE and doAppend = TRUE), the correct current size in bytes of the string list. (Ignored if doString = FALSE or doAppend = FALSE.) On output, the current size in bytes of the string list. (Indeterminate if doString = FALSE.)
doString

A value of type Boolean. On input, if TRUE, instructs IntlTokenize to create a Pascal string representing the contents of each token it generates. If FALSE, IntlTokenize generates a token list without an associated string list.
doAppend

A value of type Boolean. On input, if TRUE, instructs IntlTokenize to append tokens and strings it generates to the current token list and string list. If FALSE, IntlTokenize writes over any previous contents of the buffer pointed to by tokenList and stringList.
doAlphanumeric

A value of type Boolean. On input, if TRUE, instructs IntlTokenize to interpret numeric characters as alphabetic when mixed with alphabetic characters. If FALSE, all numeric characters are interpreted as numbers.
doNest

A value of type Boolean. On input, if TRUE, instructs IntlTokenize to allow nested comments (to any depth of nesting). If FALSE, comment delimiters may not be nested within other comment delimiters.
leftDelims

A value of type DelimType. On input, an array of two integers, each of which contains the token code of the symbol that may be used as an opening delimiter for a quoted literal. If only one opening delimiter is needed, the other must be specified to be delimPad.
rightDelims

A value of type DelimType. On input, an array of two integers, each of which contains the token code of the symbol that may be used as the matching closing delimiter for the corresponding opening delimiter in the leftDelims field.
leftComment

A value of type CommentType. On input, an array of two pairs of integers, each pair of which contains codes for the two token types that may be used as opening delimiters for comments.
rightComment

A value of type CommentType. On input, an array of two pairs of integers, each pair of which contains codes for the two token types that may be used as closing delimiters for comments.
escapeCode

A value of type TokenType. On input, a single integer that contains the token code for the symbol that may be an escape character within a quoted literal.
decimalCode

A value of type TokenType. On input, a single integer that contains the token type of the symbol to be used for a decimal point.
itlResource

A value of type Handle. On input, a handle to the tokens ('itl4') resource of the script system under which the source text was created.
reserved

An 8-byte array of type LongInt.On input, must be set to 0.

Before calling the IntlTokenize function, allocate memory for and set up the following data structures:

A token block structure (data type TokenBlock). The token block structure is a parameter block that holds both input and output parameters for the IntlTokenize function.
A token list to hold the results of the tokenizing operation. To set up the token list, estimate how many tokens will be generated from your text, multiply that by the size of a token structure, and allocate a memory block of that size in bytes. An upper limit to the possible number of tokens is the number of characters in the source text.
A string list, if you want the IntlTokenize function to generate character strings for all the tokens. To set up the string list, multiply the estimated number of tokens by the expected average size of a string, and allocate a memory block of that size in bytes. An upper limit is twice the number of tokens plus the number of bytes in the source text.

IntlTokenize creates tokens based on information in the tokens ('itl4') resource of the script system under which the source text was created. You must load the tokens resource and place its handle in the token block structure before calling the IntlTokenize function.

The token block structure contains both input and output values. At input, you must provide values for the fields that specify the source text location, the token list location, the size of the token list, the tokens ('itl4') resource to use, and several options that affect the operation. You must set reserved locations to 0 before calling IntlTokenize.

On output, the token block structure specifies how many tokens have been generated and the size of the string list (if you have selected the option to generate strings).

The results of the tokenizing operation are contained in the token list, an array of token structures (data type TokenRec).

Pascal strings are generated if the doString parameter in the token block structure is set to TRUE. The string is a normalized version of the source text that generated the token; alternate digits are replaced with ASCII numerals, the decimal point is always an ASCII period, and 2-byte Roman letters are replaced with low-ASCII equivalents.

To make a series of calls to IntlTokenize and append the results of each call to the results of previous calls, set doAppend to FALSE and initialize tokenCount and stringCount to 0 before making the first call to IntlTokenize. (You can ignore stringCount if you set doString to FALSE.) Upon completion of the call, tokenCount and stringCount will contain the number of tokens and the length in bytes of the string list, respectively, generated by the call. On subsequent calls, set doAppend to TRUE, reset the source and sourceLength parameters (and any other parameters as appropriate) for the new source text, but maintain the output values for tokenCount and stringCount from each call as input values to the next call. At the end of your sequence of calls, the token list and string list will contain, in order, all the tokens and strings generated from the calls to IntlTokenize.

If you are making tokens from text that was created under more than one script system, you must load the proper tokens resource and place its handle in the token block structure separately for each script run in the text, appending the results each time.

Delimiters for quoted literals are passed to IntlTokenize in a two-integer array.

The individual delimiters, as specified in the leftDelims and rightDelims parameters, are paired by position. The first (in storage order) opening delimiter in leftDelims is paired with the first closing delimiter in rightDelims.

Comment delimiters may be 1 or 2 tokens each and there may be two sets of opening and closing pairs. They are passed to IntlTokenize in a commentType array.

If only one token is needed for a delimiter, the second token must be specified to be delimPad. If only one delimiter of an opening-closing pair is needed, then both of the tokens allocated for the other symbol must be delimPad. The first token of a two-token sequence is at the higher position in the leftComment or rightComment array.

When IntlTokenize encounters an escape character within a quoted literal, it places the portion of the literal before the escape character into a single token (of type tokenLiteral), places the escape character into another token (tokenEscape), places the character following the escape character into another token (whatever token type it corresponds to), and places the portion of the literal following the escape sequence into another token (tokenLiteral). Outside of a quoted literal, the escape character has no special significance.

IntlTokenize considers the character specified in the decimalCode parameter to be a decimal character only when it is flanked by numeric or alternate numeric characters, or when it follows them.

SPECIAL CONSIDERATIONS

IntlTokenize may move memory; your application should not call this function at interrupt time.

Because each call to IntlTokenize must be for a single script run, there can be no change of script within a comment or quoted literal.

Comments and quoted literals must be complete within a single call to IntlTokenize in order to avoid syntax errors.

IntlTokenize always uses the tokens resource whose handle you pass it in the token block structure. Therefore, it is not directly affected by the state of the font force flag or the international resources selection flag. However, if you use the GetIntlResource function to get a handle to the tokens resource to pass to IntlTokenize, remember that GetIntlResource is affected by the state of the international resources selection flag.

AVAILABILITY

Supported in Carbon. Available in Carbon 1.0.2 and later when running Mac OS 8.1 or later.